This folder is the beta version of the data release for 

Guzman, J, and Aishen Li. 2019. Measuring Founding Strategy. 

It includes all data created in the above mentioned paper.  Code to 
replicate the data production process is also available in the 
GitHub repository. 

http://www.github.com/measuring-founding-strategy



Folder Overview
---------------

The data release is divided into six folders:


Folder 'doc2vec and tfidf models':

	Contains the actual NLP models created through both doc2vec and 
	tf-idf for each year of Crunchbase startups.  These models can 
	be used to assess similarity of any new startup to this population,
	amongst the startups in other populations in the same cohort, or
	other firms.


Folder 'hp industries model':
	Contains the model for HP industries created by doing a k-means 
	clustering with 300 clusteres on the tf-idf distance of all website 
	code in our data.  This is intended to replicate the approach of 
	Hoberg and Phillips (2016) within our data.


Folder 'most similar websites list':
	Contains a file listing, for all startups in our data, the five most
	similar public firms and the five most similar startups, as well
	as the for 500 characters of the downloaded website text for each startup 
	and public firm.


Folder 'similarity scores':
 	This is the actual estimates of similarity scores that are used in the 
 	paper. 

 Folder 'website text':
 	all website text by year.

 
 Folder 'stata scripts': 
 	Includes all scripts used to build this paper such as the main regressions run
 	and the code that can be used to build the file done for analysis if 
 	the user has access to the Crunchbase academic files.




 Contact
 -----------------

For questions please reach out to 

Jorge Guzman
jag2367@gsb.columbia.edu


Aishen Li




